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Abstract. We develop an efficient parallel algorithm for answering shortest- 
path queries in planar graphs and implement it on a multi-node CPU/GPU 
clusters. The algorithm uses a divide-and-conquer approach for decom¬ 
posing the input graph into small and roughly equal subgraphs and con¬ 
structs a distributed data structure containing shortest distances within 
each of those subgraphs and between their boundary vertices. For a pla¬ 
nar graph with n vertices, that data structure needs 0{n) storage per 
processor and allows queries to be answered in time. 
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1 Introduction 

Finding shortest paths (SPs) in graphs has applications in transportation, social 
network analysis, network routing, and robotics, among others. The problem 
asks for a path of shortest length between one or more pairs of vertices. There 
are many algorithm for solving SP problems sequentially. Dijkstra’s algorithm [2] 
finds the distances between a source vertex v and all other vertices of the graph 
in 0(771 log n) time, where n and m are the numbers of the vertices and edges 
of the graph, respectively. It can also be used to find efficiently the distance 
between a pair of vertices. This algorithm is nearly optimal (within a logarithmic 
factor), but has irregular structure, which makes it hard to implement efficiently 
in parallel. Floyd-Warshall’s algorithm, on the other hand, finds the distances 
between all pairs of vertices of the graph in 0{n^) time, which is efficient for dense 
{m = 0{n?)) graphs, has a regular structure good for parallel implementation, 
but is inefficient for sparse {m = 0{n)) graphs such as planar graphs. 

In this paper we are considering the query version of the problem. It asks 
to construct a data structure that will allow to answer any subsequent distance 
query fast. A distance query asks, given an arbitrary pair of vertices v, w, to 
compute dist(7;, w). This problem has applications in web mapping services such 
as MapQuest and Google Maps. There is a tradeoff between the size of the data 
structure and the time for answering a query. For instance, Dijkstra’s algorithm 
gives a trivial solution of the query version of the SP problem with (small) 
0{n + m) space (for storing the input graph), but large 0{mlogn) query time 


(for running Dijkstra’s algorithm with a source the first query vertex). On the 
other end of the spectrum, Floyd-Warshall’s algorithm can be used to construct 
a (large) 0{v?) data structure (the distance matrix) allowing (short) 0(1) query 
time (retrieving the distance from the data base). However, for very large graphs, 
the O(n^) space requirement is impractical. We are interested in an algorithm 
that needs significantly less than than 0{n?) space, but will answer queries 
faster than Disjkstra’s algorithm. Our algorithm will use the structure of planar 
graphs for increased efficiency, as most road networks are planar or near-planar, 
and will also be highly parallelizable, making use of the features available in 
modern high-performance clusters and specialized processors such as the GPUs. 

The query version for shortest path queries in planar graphs was proposed 
in [3] and after that different aspects of the problem were studied by multiple 
authors, e.g., mm- Here we present the first distributed implementation for 
solving the problem that is designed to make use of the potential for parallelism 
offered by GPUs. Our solution makes use of the fast parallel algorithm for com¬ 
puting shortest paths in planar graphs from [4] , resulting in asymptotically faster 
and also shown to be efficient in practice. 


2 Preliminaries 


Given a graph G with a weight wt(e) on each edge e, the length of a path p is 
the sum of the weights of the edges of the path. The single-pair shortest path 
problem (SPSP) is, given a pair v,w of vertices of G, to find a path between 

V and w, called shortest path (SP), with minimum length. The length of that 
path is called distance between v and w and is denoted as dist(u,r(;). For any 
subgraph P[ of G, the distance between v and u> in 77 is denoted as dist//(u, w). 
The single-source shortest path problem (SSSP) is to find SPs from a fixed vertex 

V to all other vertices of G. Finally, the all-pairs shortest path problem {APSP) 
is to find SPs between all pairs of vertices. There are distance versions of SPSP, 
SSSP, and APSP, which are more commonly studied, where the objective is to 
compute the corresponding distances instead of SPs. Most distance algorithms 
allow the corresponding SPs to be retrieved in additional time proportional to 
the number of the edges of the path. In this paper, by SPSP, SSSP, and APSP 
we mean the distance versions of these problems. 

A fc-partition V of G is a set Vi,..., 14 of subsets of V (G), the set of the 
vertices of G, such that Vi H Vj = 0 if i ^ j and IJfci — ^(^)- We call the 
subgraphs of G induced by U components of V. The boundary of the partition 
consists consists of all vertices of G that have at least one neighbor in a different 
component. We denote by BG{G) or simply by BG the subgraph of G induced by 
the boundary vertices. For any C £V, we denote by B{G) the set of all boundary 
vertices that are from G. For any planar graph of n vertices and bounded (0(1) 
as a function of n) vertex degree one can find in 0(n) time a ^-partition P with 
177(G)I = 0{y/njk) for each component G GV. 




3 Algorithm overview and analysis 


Our algorithm works in two modes: preprocessing mode, during which a data 
structure is computed that allows efficient SP queries, and the query mode that 
uses that data structure to compute the distance between a query pair of vertices. 
We assume that the input is a planar graph G of n vertices and bounded vertex 
degree and the cluster has p nodes. 

3.1 Preprocessing mode 

The preprocessing algorithm (Algorithm has three phases. During the first 
phase (line 1), the graph is partitioned and each component is assigned to a 
distinct cluster node. During the second phase (lines 2-5), the APSP problem is 
solved for each component C independently and in parallel and the computed 
distance matrix APSP(G) is stored at the same node. Finally, in the third phase 
(lines 6-10), the boundary graph BG is constructed and the APSP is solved for 
BG. That computation is done distributedly such that the distances from vertex 
V G BG to all other vertices of BG are computed at the node containing v, by 
using Dijkstra’s algorithm [5]. The computed distance matrix is stored at the 
node that has done the computations. Hence, at the end of the algorithm, the 
node N{C) contains two matrices: one containing the SP distances in C and the 
other containing all SP distances in BG with source a vertex in BG n C. 

One can think of BG as a compressed version of G where the non-boundary 
vertices are removed, but are implicitly represented in BG by the information 
encoded in its edge weights. Note however that the distances APSP(G) (and the 
corresponding edge weights of BG) are not distances in G; the reason is that a 
shortest path between two vertices v and w from G might pass through vertices 
not in G. Hence the following fact is non-trivial. 

Lemma 1. ^ For any two vertices v,w G BG the distance between v and w in 
BG is equal to the distance between v and w in G. 

We will next estimate the time and space (memory) required to run the al¬ 
gorithm. As G is planar and of bounded vertex degree (as a function of n), it 
can be divided in 0(n) time into k parts so that each part has no more than 
(n/k) vertices and 0{^/njk) boundary vertices [S]. We will estimate the require¬ 
ments of each phase. Since the maximum amount of coarse-grained parallelism 
of Algorithmj^is min{p, fc}, we assume without loss of generalization that p < k. 

Phase 1 requires 0{n) running time and 0(n) space [5]. 

The complexity of Phase 2 is dominated by the time for computing dis¬ 
tances in line 3. We assume that we are using the algorithm from [3] that can 
be implemented efficiently on a GPU-accelerated architecture and has complex¬ 
ity 0(Af®/'*). Then Phase 2 requires 0((A:/p)(n/fc)®/^) = 0{n^^'^/{pk^^'^)) time 
and kO{{n/k)y/n/k) = 0(n^/^/fc^/^) total space. The space per processor is 
kO{{n/k)^) = Oiri^/k). 

For Phase 3, the number of the vertices of BG is kO{yJn/k) = 0{'/nk) and 
the number of the edges is k Ody^n/k)"^) = 0{n). One execution of line 8 (for one 


component C) takes {k/p)\BGr[C\\E{BG)\\og{\BG\) = {k/p)0{y^n/k)0{n\ogn) 
time and 0{n) space. The space needed for one iteration of Step 9 is \BG n 
(7||i?G| = o\^/r^Jk^/rlk) = 0(n). Hence Phase 3 requires logn) 

= logn/p) time and 0{nklp) space per processor. 

Summing up the requirements for Phases 1, 2, and 3, we get j{pk^^‘^) + 

^3/2 1 ^ 1/2 logn/p)) time and 0 (n + n'^/{pk) + nk/p) space per processor needed 
for Algorithm]^ Assuming space is more important in this case than time (since 
nodes have limited memory), we find that k = minimizes the function 

V?/k + nk. Hence we have the following result. 

Lemma 2. With k = and p < k, Algorithm^ runs in logn/p) 

time and uses jp) space per processor. With p = k, the time and space 

are 0(n®/^) and 0{n), respectively. 

The time bound of Lemma [2] is conservative as it doesn’t take into account 
our use of fine-grain parallelism due to multi-threading, e.g., by the GPUs. 


Algorithm 1 Preprocessing algorithm 
Input: A planar graph G 

Output: A data structure for efficient shortest path queries in G 
/* Partitioning */ 

1: Construct a fc-partition V oi G and assign each component G to a distinct node 
N{C) 

/*■ Solve the APSP problem for each component */ 

2: for all components G dV do in parallel 

3: Solve APSP for G and save the distances in a table APSP(G) 

4: For each pair of boundary vertices v,w £ G define edge (v, w), if not already in 

G, and assign a weight wt(i;, w) = distc(n, iv) 

5: end for 

/*■ Solve the APSP problem for the boundary graph */ 

6: Define a boundary graph BG with vertices all boundary vertices of G and edges as 
defined in the previous step and store it at each node 
7: for all components G GV do in parallel 
8: Solve SSSP in BG for each vertex of G C BG 

9: Store the distances from all vertices of G n BG to all vertices of BG in a 

table APSPflG(G) 

10: end for 


3.2 Query mode 

The query algorithm (Algorithm]^ is based on the fact that if Ci ^ C 2 , then 
any path between vi and V 2 should cross both B{Ci) and B{C 2 ). Let tt be a 
shortest path between vi and V 2 . Then tt can be divided into three parts: from 
vi to a vertex bi from B{Ci), from bi to a vertex 62 on p from B{C 2 ), and from 
62 to V 2 . Vertices bi and 62 minimizing the length of p are found as follows: in 







the loop on lines 2 - 7 , for each 62 an optimal bi and dist(ni, 62) are found; in lines 
10-12 an optimal 62 is found. 


Algorithm 2 Query algorithm 

Input: Vertices vi,V2 of G, a fc-partition V of G, tables APSP(C) and APSPbg(C') 
for all C G P 
Output: dist(wi,?; 2 ) 

1: Determine components Gi and C2 such that vi G Ci, V2 £ C2 
2: for all vertices 62 G B{G2) do in parallel 
/* Compute dist(ui,&2) */ 

3 : dist(?;i, 62) = 00 

4 : for all vertices 61 G B{C\) do 

5 : dist(ui,62) = min{dist(ui, 62), distci (ui, &i) + distBG(6i, 62)} 

6: end for 

7: end for 

8: If N(Ci) 7^ N{C2) then transfer the column of SP(C2) corresponding to V2 from 
N{G2) to N{G^). 

/* Now we can compute dist(i!i,U2) */ 

9 : dist(ui,V2) = 00 
10: for all vertices &2 £ B{G2) do 

11: dist(ui, W2) = min{dist(ui, U2), dist(ui, 62)-f distc2(62, U2)} 

12: end for 

13: If Cl = C2 then dist(ui, V2) = min{dist(ui, U 2 ), distci (ui, U 2 )}, where the distance 
distci (ui, U2) is taken from APSP(Ci). 


Lemma 3. Algorithm^ correctly computes dist(wi,ti 2 ) and its running time is 
with k = and p > 

Proof. Let tt be a shortest path between vi and v^, let Ci 7^ C2, and let 61 be 
the first vertex along tt that is on B{Ci) and tti be the subpath of tt from vi to 
bi, let tt2 be the last vertex along tt that is on B{C2) and tt2 be the subpath of tt 
from bi to &2, and let 713 be the subpath of tt from 62 to V2. Then tti is entirely in 
Cl and hence distcj^ (ui, &i) = distG(?;i, 61) (note, however, that the distances in 
APSP(Ci) from vi to other vertices from B{Ci) may not be correct). Similarly, 
distc2(&2,'y2) = distG(&2,n2). Finally, distBG(fei, ^2) = distG(&i,&2) by Lemma 
Hence lines 5 and 11 use correct values for computing the distances between vi 
and 62 and between 62 and V2. 

If Cl = C2 (line 13 ), then a shortest path between vi and V2 may or may not 
leave Ci. In the first case lines 1-12 compute correctly dist(ni, 02), in the second 
case APSP(Ci) contains the correct distance. 

The loop on lines 5-10 takes time |H(Ci)||H(C2)|/p = 0 (Cn/fc-\/n/fc/p) = 
0 {n/{pk)), for p < minj/c, (n//c)^/^}. If k = and p = (the maximum 
value for which the formula applies), that time becomes The loop in 

lines 10-12 takes time 0 {{n/kY/'^) = 0(n^/‘^) for k = 






Note that using the methodology of [3], a more complex implementation of 
Algorithm can reduce the query time to logarithmic. Note also that compu¬ 
tation in lines 2-7 can be overlapped with transferring of data in line 8 thereby 
saving time (upto a factor of two). 

4 Implementation details 

In this section, we describe how the preprocessing and query modes are imple¬ 
mented on a hybrid CPU-GPU cluster. We use a distance matrix to represented 
both the input graph G and the output. Such a 2-dimensional matrix contains 
in cell (i, j) the value of the distance from vertex i to vertex j. Initially, cell {i,j) 
contains wt{i,j) if an edge (i,j) is present in G, or infinity otherwise. These 
values are updated as the algorithm progresses. At the end of the algorithm, cell 
{i,j) contains dist(qj). 

In phase 1 of the preprocessing mode, we construct a Ic-partition of G using 
the METIS library [7]. Based on that partition, we reorder the vertices of G so 
that vertices from the same component have consecutive indices and boundary 
vertices of each components have the lowest indices - see Figure 



Fig. 1: Distance matrix after reorder¬ 
ing of the vertices. Vertices from the 
same component are stored contigu¬ 
ously starting with boundary vertices. 
Red submatrices are also part of the 
boundary distance matrix. Grey sub¬ 
matrices do not generate any compu¬ 
tations in preprocessing mode. 


W 



Fig. 2: The distances required to com¬ 
pute dist(?;,w), shown in green, are 
scattered in three submatrices: two di¬ 
agonal ones, for component I and for 
component J, and a non-diagonal sub¬ 
matrix (J, J). 


In phase 2, we compute the shortest distances within each of the components. 
For k components, this phase gives a total k independent tasks that can be 
executed in parallel. Gomputations at this phase are already balanced across 












































nodes as components contain roughly the same number of vertices and the APSP 
algorithm from [3] ensures the same complexity with respect to the 

number of nodes. 

Finally, phase 3 consists in computing the shortest distances within the 
boundary graph using Dijkstra’s algorithm. Computations at this phase may 
be imbalanced between nodes for two reasons. First, the number of boundary 
vertices in two components may differ and, second, the complexity of Dijkstra’s 
algorithm does not solely depend on the number of vertices in the graph, but 
also on the number of edges, which may vary even more than the number of 
vertices between two components’ boundary graphs. 

In the query mode, we are interested in finding dist(z), w), where v and w are 
from components I and J, respectively. The required values for that computation 
are scattered in three submatrices, as illustarted in Figure For such a query, 
assuming k = p, node i, holding the required values from diagonal submatrix 
I and non-diagonal submatrix will be in charge of the computations. 

Required values from diagonal submatrix J are held by node j and need to be 
transfered to node i. 


5 Experimental evaluation 

In this section we describe experiments designed to test our algorithm and its 
implementation. Specifically, we are going to test the strong scaling properties by 
running our code on a fixed graph size and a varying number p of cluster nodes 
and number k of components. All computations are run on a 300 node cluster. 
Each cluster node is comprised of 2 x Eight-Core Intel Xeon model E5-2670 @ 
2.6 GHz and two GPGPU Nvidia Tesla M2090 cards connected to PCIe-2.0 xl6 
slots. In order to make full use of the available GPUs, each node is assigned at 
least two graph components so that the two associated diagonal submatrices can 
be computed simultaneously on the two GPUs. 

Eor the strong-scaling experiment, the graph size is fixed to 256k vertices. 
Preprocessing and queries are run with increasing numbers of nodes ranging from 
4 to 64. Each node handles 2 components (one per available GPU); therefore the 
number of components k ranges from 8 to 128. 

Eigure shows the run times for the preprocessing mode. For low numbers 
of nodes and thus low values of k, preprocessing time is dominated by step 2 - 
the computation of the shortest distances within each component - since lower k 
values means larger components. For higher numbers of nodes and thus higher 
values of fc, preprocessing time becomes dominated by step 3 - the computation 
of the boundary graph - as more components mean higher numbers of incident 
edges and thus larger boundary graphs. Note that while the figure seems to show 
supralinear speedup, that is not the case (and similarly for the memory usage). 
The reason is that, with increasing the number of processors p, the number k of 
parts is increased too (as it is tied to p in this implementation) and hence the 
complexity of the algorithm is also reduced. 



Fig. 3: Preprocessing run times for a 
fixed graph size of 256k vertices and in¬ 
creasing number of nodes. 


Fig. 4: Peak memories and run times for 
10k queries for a fixed graph size of 
256k vertices and increasing number of 
parts/processors. 


Figure shows the query times and peak memory usage per node. The run 
times are given for 10,000 queries from random sources to random targets. Note 
that in the query mode only fine-grain (node-level) parallelism is used, while 
multiple nodes are still needed for distributed storage and, optionally, to handle 
multiple queries in parallel (not implemented in the current version). For the 
memory usage, the optimal value for k, theoretically expected to be y/n - or 512 
for this instance - is not reached in this experiment since k only goes up to 128. 
We can however see that peak memory usage per node is still dropping with 
increasing values of k up to 128. The query times in the figure vary from about 2 
milliseconds per query for fc = 8 to 0.25 milliseconds for k = 128. Compared with 
the Boost library implementation of Dijkstra’s algorithm, our implementation 
answers queries on the largest instances about 1000 times faster. 

6 Conclusion 

We developed and implemented a distributed algorithm for shortest path queries 
in planar graphs with good scalability. It allows answering SP queries in 
time by using 0{\/n) processors with 0(n) space per processor and 
preprocessing time. Our implementation on 300 node CPU-GPU cluster has 
preprocessing time of less than 10 seconds using 32 or more nodes and 0.025 
milliseconds per query using two nodes. Interesting tasks for future research is 
implementing a version allowing parallel queries and reducing the query time of 
the implementation to O(logn) by using properties of graph planarity. 
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